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Abstract 

We show an efficient algorithm for the following problem: Given uniformly random 
points from an arbitrary n-dimensional simplex, estimate the simplex. The size of 
the sample and the number of arithmetic operations of our algorithm are polynomial 
in n. This answers a question of Frieze, Jerrum and Kannan [5]. Our result can 
also be interpreted as efficiently learning the intersection of n + 1 half-spaces in W 1 
in the model where the intersection is bounded and we are given polynomially many 
uniform samples from it. Our proof uses the local search technique from Independent 
Component Analysis (ICA), also used by [5]. Unlike these previous algorithms, which 
were based on analyzing the fourth moment, ours is based on the third moment. 

1 Introduction 

We are given uniformly random samples from an unknown convex body in MJ 1 , how many 
samples are needed to approximately reconstruct the body? It seems intuitively clear, at least 
for n = 2,3, that if we are given sufficiently many such samples then we can reconstruct (or 
learn) the body with very little error. For general n, it is known to require samples [B] 

(see also [TT] for a similar lower bound in a different but related model of learning). This is 
an information-theoretic lower bound and no computational considerations are involved. As 
mentioned in [6], it turns out that if the body has few facets (e.g. polynomial in n), then 
polynomial in n samples are sufficient for approximate reconstruction. This is an information- 
theoretic upper bound and no efficient algorithms (i.e., with running time poly(n)) are 
known. (We remark that to our knowledge the same situation holds for polytopes with 
pofy(n) vertices.) In this paper we study the reconstruction problem for the special case 
when the input bodies are restricted to be (full-dimensional) simplices. We show that in 
this case one can in fact learn the body efficiently. More precisely, the algorithm knows that 
the input body is a simplex but only up to an affine transformation, and the problem is to 
recover this affine transformation. This answers a question of Frieze et al. Section 6]. 

The problem of learning a simplex is also closely related to the well-studied problem of 
learning intersections of half-spaces. Suppose that the intersection of n + 1 half-spaces in R™ 
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is bounded, and we are given poly(n) uniformly random samples from it. Then our learning 
simplices result directly implies that we can learn the n + 1 half-spaces. This also has the 
advantage of being a proper learning algorithm, meaning that the output of the algorithm 
is a set of n + 1 half-spaces, unlike many of the previous algorithms. 

Previous work. Perhaps the first approach to learning simplices that comes to mind is 
to find a minimum volume simplex containing the samples. This can be shown to be a good 
approximation to the original simplex. (Such minimum volume estimators have been studied 
in machine learning literature, see e.g. [TH] for the problem of estimating the support of a 
probability distribution. We are not aware of any technique that applies to our situation 
and provides theoretical guarantees.) However, the problem of finding a minimum volume 
simplex is in general NP-hard [IS] . This hardness is not directly applicable for our problem 
because our input is a random sample and not a general point set. Nevertheless, we do 
not have an algorithm for directly finding a minimum volume simplex; instead we use ideas 
similar to those used in Independent Component Analysis (ICA). ICA studies the following 
model: Given a sample from an affine transformation of a random vector with independently 
distributed coordinates, recover the affine transformation. Frieze et al. [5] gave an efficient 
algorithm for this problem (with some restrictions on the allowed distributions, but also 
with some weaker requirements than full independence) along with most of the details of a 
rigorous analysis (a complete analysis of a special case can be found in Arora et al. [3J ; see 
also Vempala and Xiao [23] for a generalization of ICA to subspaces along with a rigorous 
analysis). The problem of learning parallelopipeds from uniformly random samples is a 
special case of this problem. They [5] asked if one could learn other convex bodies, and 
in particular simplices, efficiently from uniformly random samples. Nguyen and Regev [IT] 
gave a simpler and rigorous algorithm and analysis for the case of learning parallelopipeds 
with similarities to the popular FastICA algorithm of Hyvarinen [H] . The algorithm in [T7] 
is a first order algorithm unlike Frieze et al.'s second order algorithm. 

The algorithms in both [51 |T7] make use of the fourth moment function of the probability 
distribution. Briefly, the fourth moment in direction u £ R" is E(u- A) 4 , where X £ R n is the 
random variable distributed according to the input distribution. The moment function can 
be estimated from the samples. The independent components of the distribution correspond 
to local maxima or minima of the moment function, and can be approximately found by 
finding the local maxima/minima of the moment function estimated from the sample. 

More information on ICA including historical remarks can be found in [10J H]. Ideas 
similar to ICA have been used in statistics in the context of projection pursuit since the 
mid-seventies. ICA cannot be applied to the simplex learning problem directly as there is no 
clear independence among the components. Let us note that Frieze et al. [5] allow certain 
kinds of dependences among the components, however it does not appear to be useful for 
learning simplices. 

Learning intersections of half-spaces is a well-studied problem in learning theory. The 
problem of PAC-learning intersections of even two half-spaces is open, and there is evidence 
that it is hard at least for sufficiently large number of half-spaces: E.g., Klivans and Sher- 
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stov [13] prove that learning intersections of n e half-spaces in MJ 1 (for constant e > 0) is hard 
under standard cryptographic assumptions (PAC-learning is possible, however, if one also 
has access to a membership oracle in addition to random samples [H]). Because of this, 
much effort has been expended on learning when the distribution of random samples is some 
simple distribution, see e.g. [121 [22j IZE] and references therein. This line of work makes 
substantial progress towards the goal of learning intersections of k half-spaces efficiently, 
however it falls short of being able to do this in time polynomial in k and n; in particular, 
these algorithms do not seem to be able to learn simplices. The distribution of samples in 
these works is either the Gaussian distribution or the uniform distribution over a ball. Frieze 
et al. [5] and Goyal and Rademacher [6] consider the uniform distribution over the inter- 
section. Note that this requires that the intersection be bounded. Note also that one only 
gets positive samples in this case unlike other work on learning intersections of half-spaces. 
The problem of learning convex bodies can also be thought of as learning a distribution or 
density estimation problem for a special class of distributions. 

Gravin et al. [7J show how to reconstruct a polytope with N vertices in M n , given its first 
0{nN) moments in (n + 1) random directions. In our setting, where we have access to only 
a polynomial number of random samples, it's not clear how to compute moments of such 
high orders to the accuracy required for the algorithm of [7J even for simplices. 

See [2] for a recent review of applications of the method of moments in other learning 
problems. 

Our results For clarity of the presentation, we use the following machine model for the 
running time: a random access machine that allows the following exact arithmetic operations 
over real numbers in constant time: addition, substraction, multiplication, division and 
square root. 

Theorem 1. There is an algorithm (Algorithm^ below) such that given access to random 
samples from a simplex S input Q with probability at least 1 — 5 over the sample and the 
randomness of the algorithm, it outputs n + 1 vectors that are the vertices of a simplex S so 
that dxviS, S input) < e- The algorithm runs in time polynomial in n, 1/e and 1/5. 

As mentioned earlier, our algorithm uses ideas from ICA. Our algorithm uses the third 
moment instead of the fourth moment used in certain versions of ICA. The third moment is 
not useful for learning symmetric bodies such as the cube as it is identically 0. It is however 
useful for learning a simplex where it provides useful information, and is easier to handle 
than the fourth moment. 

The probability of success of the algorithm can be "boosted" so that the dependence of 
the running time on 5 is only linear in log(l/5) as follows: The following discussion uses the 
space of simplices with total variation distance as the underlying metric space. Let e be the 
target distance. Take an algorithm that succeeds with probability 5/6 and error parameter 
e' to be fixed later (such as Algorithm [1] with 5 = 1/6). Run the algorithm t = O (log 1/5) 
times to get t simplices. By a Chernoff-type argument, at least 2t/3 simplices are within e' 
of the input simplex with probability at least 1 — 5/2. 
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By sampling, we can estimate the distances between all pairs of simplices with additive 
error less than e'/lO in time polynomial in t, 1/e' and log 1/5 so that all estimates are correct 
with probability at least 1 — 5/2. For every output simplex, compute the number of output 
simplices within estimated distance (2 + l/10)e'. With probability at least 1 — 5 both of the 
desirable events happen, and then necessarily there is at least one output simplex, call it S, 
that has 2t/3 output simplices within estimated distance (2 + l/10)e'. Any such S must be 
within (3 + 2/10)e' of the input simplex. Thus, set e' = e/(3 + 2/10). 

Idea of the algorithm. Many of the details of our proofs follow an outline similar to 
previous work. The main new idea is that, after putting the samples in a suitable position 
(see below), the third moment of the sample can be used to recover the simplex using a 
simple FastlCA-like algorithm. We outline our algorithm next. 

As any full-dimensional simplex can be mapped to any other full-dimensional simplex 
by an invertible affine transformation, it is enough to determine the translation and linear 
transformation that would take the given simplex to some canonical simplex. As is well- 
known for ICA-like problems (see, e.g., [5]), this transformation can be determined up to a 
rotation from the mean and the covariance matrix of the uniform distribution on the given 
simplex. The mean and the covariance matrix can be estimated efficiently from a sample. A 
convenient choice of an n-dimensional simplex is the convex hull of the canonical vectors in 
M n+1 . We denote this simplex A n and call it the standard simplex. So, the algorithm begins 
by picking an arbitrary invertible affine transformation T that maps M. n onto the hyperplane 
{x G M n+1 : 1 • x = 1}. We use a T so that T~ 1 (A n ) is an isotropic^] simplex. In this 
case, the algorithm brings the sample set into isotropic position and embeds it in IR n+1 using 
T. After applying these transformations we may assume (at the cost of small errors in the 
final result) that our sample set is obtained by sampling from an unknown rotation of the 
standard simplex that leaves the all-ones vector (denoted 1 from now on) invariant (thus 
this rotation keeps the center of mass of the standard simplex fixed), and the problem is to 
recover this rotation. 

To find the rotation, the algorithm will find the vertices of the rotated simplex approxi- 
mately. This can be done efficiently because of the following characterization of the vertices: 
Project the vertices of the simplex onto the hyperplane through the origin orthogonal to 1 
and normalize the resulting vectors. Let V denote this set of n + 1 points. Consider the 
problem of maximizing the 3rd moment of the uniform distribution in the simplex along 
unit vectors orthogonal to 1. Then V is the complete set of local maxima and the complete 
set of global maxima (Theorem [6]). A fixed point-like iteration (inspired by the analysis of 
gradient descent in [IT]) starting from a random point in the unit sphere finds a local max- 
imum efficiently with high probability. By the analysis of the coupon collector's problem, 
O(nlogn) repetitions are highly likely to find all local maxima. 

Idea of the analysis. In the analysis, we first argue that after putting the sample in 
isotropic position and mapping it through T, it is enough to analyze the algorithm in the 

x See Section [2 
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case where the sample comes from a simplex S that is close to a simplex S' that is the 
result of applying a rotation leaving 1 invariant to the standard simplex. The closeness here 
depends on the accuracy of the sample covariance and mean as an estimate of the input 
simplex's covariance matrix and mean. A sample of size 0(n) guarantees ((TJ Theorem 
4.1], j20t Corollary 1.2]) that the covariance and mean are close enough so that the uniform 
distributions on S and 5" are close in total variation. We show that the subroutine that finds 
the vertices (Subroutine [T]), succeeds with some probability when given a sample from 5". By 
definition of total variation distance, Subroutine [T] succeeds with almost as large probability 
when given a sample from S (an argument already used in [IT]). As an additional simplifying 
assumption, it is enough to analyze the algorithm (Algorithm [T|) in the case where the input 
is isotropic, as the output distribution of the algorithm is equivariant with respect to affine 
invertible transformations as a function of the input distribution. 

Organization of the paper. Starting with some preliminaries in Sec. |2l we state some 
results on the third moment of simplices in Sec. |3j In Sec. H] we give an algorithm that 
estimates individual vertices of simplices in a special position; using this algorithm as a 
subroutine in Sec. Owe give the algorithm for the general case. Sec. El characterizes the set 
of local maxima of the third moment. 



2 Preliminaries 

An n-simplex is the convex hull of n + 1 points in W 1 . We will be interested in the non- 
degenerate case where these n+1 points do not lie on an (n— l)-dimensional affine hyperplane. 
It will be convenient to work with the standard n-simplex A n living in M n+1 defined as the 
convex hull of the n + 1 canonical unit vectors ei, . . . , e n+ i, that is 

A n = {(x , . . . ,x n ) e R n+1 |x + • • • + x n = 1 

and Xi > for all i}. 

The canonical simplex Q n living in MJ 1 is given by 

{(x , . . . , x n _i) G W 1 \x H h x n _i < 1 

and Xi > for all i}. 

Note that A n is the facet of Q n+1 opposite to the origin. 
Let B n denote the n-dimensional Euclidean ball. 

The complete homogeneous symmetric polynomial of degree d in variables Uq, . . . , u n , 
denoted h n (uo, . . . , u n ), is the sum of all monomials of degree d in the variables: 

h d {u ,...,u n ) = u o--- u n n 

fc H hk n =d 
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Also define the d-ih power sum as 

p d (u ,...,u n ) =u d Q + . . . + u d n . 
For a vector u = (m , Ui, . . . , u n ), we define 

(2) / 2 2 2\ 

u y > = • • • ,u n ). 

Vector 1 denotes the all ones vector (the dimension of the vector will be clear from the 
context). 

A random vector X G M n is isotropic if E(X) = and E(JX T ) = I. A compact set in 
M. n is isotropic if a uniformly distributed random vector in it is isotropic. The inradius of an 
isotropic n-simplex is \/ (n + 2)/n, the circumradius is ^/ n(n + 2). 

The total variation distance between two probability measures is oItv{^i v ) — su Pa 1/^(^4) — 
v{A)\ for measurable A. For two compact sets K,L C ]R n , we define the total variation 
distance oItv{K,L) as the total variation distance between the corresponding uniform dis- 
tributions on each set. It can be expressed as 

( ^' L) ifvolL>volK. 
This identity implies the following elementary estimate: 

Lemma 2. Let K, L be two compact sets in W 1 . Let < a < 1 < such that aK CIC f3K. 
Then d TV (K,L) < 2 (1 - (a/(3) n ). 

Proof. We have dTv{ a K, f3K) = 1 — (a//3) n . Triangle inequality implies the desired inequal- 
ity. □ 

Lemma 3. Consider the coupon collector's problem with n coupons where every coupon 
occurs with probability at least a. Let 5 > 0. Then with probability at least 1 — 5 all coupons 
are collected after a _1 (logn + log 1/5) trials. 

Proof. The probability that a particular coupon is not collected after that many trials is at 
most 

(1 — a ) Q_1 ( 1 °g n+lo s 1 / <5 ) < e - 1 °g n - 1 °g 1 /' 5 = fi/ n _ 
The union bound over all coupons implies the claim. □ 

3 Computing the moments of a simplex 

The k-th moment m^u) over A n is the function 

u^E XeAn ((u-X) k ). 



6 



In this section we present a formula for the moment over A n . Similar more general formulas 
appear in [TH]. We will use the following result from [5] for > 0: 



T tto,.. r anJ T oc .---a n \ 



/n» +1 ° " (n + l + Ei«i)!' 

From the above we can easily derive a formula for integration over A 7 

a ! • • -a n \ 



\ Xq° ■ ■ -x° n dx = VnTT ■ 
Ja" 



Now 



A n n (* + £,«i)l' 

(x u + . . . + x n u n ) k dx 



fcoH hfcn=fe 

E 

fcoH hfcn=A: 



fc \ kn k y/n + 1 ■ k l . . . k n l 



k l,...,k n \J '" " (n + E**i)« 



_ fclx/^+T v fc0 fcn 

1 , , V 

(n + k)\ 

The variant of Newton's identities for the complete homogeneous symmetric polynomial 
gives the following relations which can also be verified easily by direct computation: 

3h 3 (u) = h 2 (u)p 1 (u) + h 1 (u)p 2 {u) + p 3 {u), 

2h 2 (u) = /ii(u)pi(u) + p 2 (u) = pi(u) 2 + p 2 (u). 

Divide the above integral by the volume of the standard simplex |A n | = \fn +T/n! to 
get the moment: 



3!\M + 1 
m ^ u > = ( n + 3)\ fe 3W/|A w | 



2(h 2 (u)p 1 (u) + h 1 {u)p 2 {u) + p 3 Q)) 

(n + l)(n + 2)(ra + 3) 
(pi(w) 3 + 3pi(m)p 2 (m) + 2p 3 (M)) 
(n+l)(ra + 2)(ra + 3) 
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4 Subroutine for finding the vertices of a rotated stan- 
dard simplex 

In this section we solve the following simpler problem: Suppose we have poly(n) samples 
from a rotated copy S of the standard simplex, where the rotation is such that it leaves 1 
invariant. The problem is to approximately estimate the vertices of the rotated simplex from 
the samples. 

We will analyze our algorithm in the coordinate system in which the input simplex is the 
standard simplex. This is only for convenience in the analysis and the algorithm itself does 
not know this coordinate system. 

As we noted in the introduction, our algorithm is inspired by the algorithm of Nguyen 
and Regev [17] for the related problem of learning hypercubes and also by the FastICA 
algorithm [9]. New ideas are needed for our algorithm for learning simplices; in particular, 
our update rule is different. With the right update rule in hand the analysis turns out to be 
quite similar to the one in [XT] . 

We want to find local maxima of the sample third moment. A natural approach to do this 
would be to use gradient descent or Newton's method (this was done in [5]). Our algorithm, 
which only uses first order information, can be thought of as a fixed point algorithm leading 
to a particularly simple analysis and fast convergence. Before stating our algorithm we 
describe the update rule we use. 

We will use the abbreviation C n = (n + l)(n + 2){n + 3)/6. Then, from the expression 
for m 3 {u) we get 

Vm 3 (u) = — (3 Pl (n) 2 l + 3p 2 (u)l + 6 Pl (u)u + 6w (2) ) . 
Solving for we get 

u {2) = CnVm 3 (u) - ^>iO) 2 l - -p 2 (w)l -pi(u)u 

= C n Vm 3 (u) - -(u ■ 1) 2 1 - ~(u • ufl - (u ■ l)u. (1) 

While the above expressions are in the coordinate system where the input simplex is 
the canonical simplex, the important point is that all terms in the last expression can be 
computed in any coordinate system that is obtained by a rotation leaving 1 invariant. Thus, 
we can compute as well independently of what coordinate system we are working in. This 
immediately gives us the algorithm below. We denote by rh 3 (u) the sample third moment, 
i.e., m 3 (u) = |^ =1 (ii • rj) 3 for t samples. This is a polynomial in u, and the gradient is 
computed in the obvious way. Moreover, the gradient of the sample moment is clearly an 
unbiased estimator of the gradient of the moment; a bound on the deviation is given in the 
analysis (Lemma H]). For each evaluation of the gradient of the sample moment, we use a 
fresh sample. 

It may seem a bit alarming that the fixed point-like iteration is squaring the coordinates 
of m, leading to an extremely fast growth (see Equation [1] and Subroutine 1). But, as in 



8 



other algorithms having quadratic convergence like certain versions of Newton's method, 
the convergence is very fast and the number of iterations is small. We show below that it 
is 0(log(n/S)), leading to a growth of u that is polynomial in n and 1/5. The boosting 
argument described in the introduction makes the final overall dependence in 5 to be only 
linear in log (1/5). 

We state the following subroutine for M n instead of M n+1 (thus it is learning a rotated 
copy of A 71 " 1 instead of A n ). This is for notational convenience so that we work with n 
instead of n + 1 . 



Subroutine 1 Find one vertex of a rotation of the standard simplex A n 1 via a fixed point 
iteration-like algorithm 

Input: Samples from a rotated copy of the n-dimensional standard simplex (for a rotation 

that leaves 1 invariant). 

Output: An approximation to a uniformly random vertex of the input simplex. 

Pick u(l) G S^ 1 , uniformly at random, 
for % — 1 to r do 

u(i + 1) :=C n _iVm 3 (n(z)) - -(u(i) ■ 1) 2 1 

- l -(u{i)-u{i)) 2 t-{u(i)-l)u{i). 

Normalize u(i + 1) by dividing by \\u(i + 1)|| 2 - 
end for 

Output u(r + 1). 



Lemma 4. Let c > be a constant, n > 20, and < 5 < 1. Suppose that Subroutine 1 
uses a sample of size t = 2 17 n 2c+22 (|) 2 In ^p 1 for each evaluation of the gradient and runs 
for r = log 4 ( c + 3 )^" lnn iterations. Then with probability at least 1 — 5 Subroutine 1 outputs a 
vector within distance l/n c from a vertex of the input simplex. With respect of the process 
of picking a sample and running the algorithm, each vertex is equally likely to be the nearest. 

Note that if we condition on the sample, different vertices are not equally likely over the 
randomness of the algorithm. That is, if we try to find all vertices running the algorithm 
multiple times on a fixed sample, different vertices will be found with different likelihoods. 

Proof. Our analysis has the same outline as that of Nguyen and Regev [17] . This is because 
the iteration that we get is the same as that of [17] except that cubing is replaced by squaring 
(see below); however some details in our proof are different. In the proof below, several of 
the inequalities are quite loose and are so chosen to make the computations simpler. 

We first prove the lemma assuming that the gradient computations are exact and then 
show how to handle samples. We will carry out the analysis in the coordinate system where 
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the given simplex is the standard simplex. This is only for the purpose of the analysis, and 
this coordinate system is not known to the algorithm. Clearly, u(i + 1) = (u(i) 2 , . . . , w(z)^). 
It follows that, 

u(i + l) = («(l)f,...,«(l)J). 

Now, since we choose u(l) randomly, with probability at least (1 — [n 2 — n)8') one of the 
coordinates of u(l) is greater than all the other coordinates in absolute value by a factor of at 
least (1+5'), where < 5' < 1. (A similar argument is made in [T7] with different parameters. 
We briefly indicate the proof for our case: The probability that the event in question does 
not happen is less than the probability that there are two coordinates u(l) a and such 
that their absolute values are within factor 1 + 5', i.e. 1/(1 + 5') < \u(l) a \/\u(l)b\ < 1 + 5'. 
The probability that for given a, b this event happens can be seen as the Gaussian area of the 
four sectors (corresponding to the four choices of signs of u(l) a , w(l)&) in the plane each with 
angle less than 25'. By symmetry, the Gaussian volume of these sectors is 25' /{n/ 2) < 26'. 
The probability that such a pair (a, b) exists is less than 2(^)5'.) Assuming this happens, 
then after r iterations, the ratio between the largest coordinate (in absolute value) and the 
absolute value of any other coordinate is at least (1 + 5') 2r . Thus, one of the coordinates is 
very close to 1 and others are very close to 0, and so «(r + l) is very close to a vertex of the 
input simplex. 

Now we drop the assumption that the gradient is known exactly. For each evaluation 
of the gradient we use a fresh subset of samples of t points. Here t is chosen so that each 
evaluation of the gradient is within ^-distance 1 / n Cl from its true value with probability at 
least 1 — 5", where c\ will be set at the end of the proof. An application of the Chernoff 
bound yields that we can take t = 200n 2ci+4 In 2 j£-; we omit the details. Thus all the r 
evaluations of the gradient are within distance 1 / n Cl from their true values with probability 
at least 1 — r8". 

We assumed that our starting vector u(l) has a coordinate greater than every other 
coordinate by a factor of (1 + 5') in absolute value; let us assume without loss of generality 
that this is the first coordinate. Hence |w(l)i| > \j\fn. When expressing in terms of 



the gradient, the gradient gets multiplied by C„_i < n 3 (we are assuming n > 20), keeping 
this in mind and letting C2 = c\ — 3 we get for j ^ 1 

\u(i + l)i| > u(i)l - l/n c * > u{i)\{\ - n~^-^) 



\u(i + ~ u(i) 2 - + l/n c 2 ~ u(i) 2 + l/n C2 
If u(i) 2 > l/n C2_C3 , where 1 < C3 < C2 — 2 will be determined later, then we get 



|«(z + l)i|/|u(i+l)j| > 



l-l/r^ 2 " 1 (u(i)^ 2 



l + l/n c s \u(i)j 

>o-iK-) 2 f4%y. (2) 



U[l) 
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Else, 



u(i+ l) 1 |/|u(i+ > 



1/n — 1/ n C2 



l/ n C 2 - C3 _|_ l/ n c 2 




c 2 — c 3 — 1 



where we used c 3 > 1 and n > 20 in the last inequality. 
We choose c 3 so that 



(i--L) 2 (i + 5')>(i + ^/2) 



(3) 



For this, 5' > 32/n C3 or equivalently C3 > (In (32/<5'))/hm suffices. 

For c 3 satisfying flSJ we have (1 - ^) 2 (1 + 5') 2 > (! + *')■ It; tnen follows from (|2J) 
that the first coordinate continues to remain the largest in absolute value by a factor of at 



Now if r is such that (1 — l/n C:i ) 2r+1 2 (1 + 5') 2 ' > ^n C2 c:i 1 , we will be guaranteed that 
\u(r + l)i|/|n(r + > |?2 C2_C3_1 . This condition is satisfied if we have (1 — l/ra C3 ) 2r+1 (l + 
S') 2r > In 02-03-1 , or equivalently ((l-l/n C3 ) 2 (l+<5 / )) 2 '' > |n C2-C3_1 . Now using © it suffices 
to choose r so that (1 + 5' /2) 2T > |n C2_C3_1 . Thus we can take r = log(4(c2 — c^){\n.n) / &). 

Hence we get \u(r + l)i|/|u(r + > |?t, C2 ~ C3_1 . It follows that for u(r + 1), the £2- 
distance from the vertex (1, 0, ... , 0) is at most 8/n C2_C3_2 < l/n C2_C3_3 for n > 20; we omit 
easy details. 

Now we set our parameters: c 3 = l + (ln(32/<5')/hin) and c 2 — c 3 — 3 = c and c\ = c 2 + 3 = 
7+c+ln(32/5')/lnn satisfies all the constraints we imposed on Ci, c 2 , c 3 . Choosing 5" = S'/r, 
we get that the procedure succeeds with probability at least 1 — (n 2 — n)5' — r5" > 1 — n 2 5'. 
Now setting 5' = 5/n 2 gives the overall probability of error 5, and the number of samples 
and iterations as claimed in the lemma. □ 

5 Learning simplices 

In this section we give our algorithm for learning general simplices, which uses the algorithm 
from the previous section as a subroutine. The learning algorithm uses an affine map T : 



least (1 + 5') after each iteration. Also, once we have > \n' 

|u(i')i|/|«(«')jl > in C2_C3_1 for a11 *' > i 
(J2J) gives that after r iterations we have 



, we have 
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]R n — > M. n+1 that maps some isotropic simplex to the standard simplex. We describe now a 
way of constructing such a map: Let A be a matrix having as columns an orthonormal basis 
of I 1 - in R n+1 . To compute one such A, one can start with the (n + l)-by-(n + 1) matrix B 
that has ones in the diagonal and first column, everything else is zero. Let QR = B be a 
QR-decomposition of B. By definition we have that the first column of Q is parallel to 1 and 
the rest of the columns span J -1 . Given this, let A be the matrix formed by all columns of 
Q except the first. We have that the set {A T ei} is the set of vertices of a regular n-simplex. 
Each vertex is at distance 



from the origin, while an isotropic simplex has vertices at distance \Jn{n + 2) from the 
origin. So an affine transformation that maps an isotropic simplex in W 1 to the standard 



To simplify the analysis, we pick a new sample r(l), . . . , r(t 3 ) to find every vertex, as this 
makes every vertex equally likely to be found when given a sample from an isotropic simplex. 
(The core of the analysis is done for an isotropic simplex; this is enough as the algorithm's 
first step is to find an affine transformation that puts the input simplex in approximately 
isotropic position. The fact that this approximation is close in total variation distance 
implies that it is enough to analyze the algorithm for the case of exact isotropic position, the 
analysis carries over to the approximate case with a small loss in the probability of success. 
See the proof below for the details.) A practical implementation may prefer to select one 
such sample outside of the for loop, and find all the vertices with just that sample — an 
analysis of this version would involve bounding the probability that each vertex is found 
(given the sample, over the choice of the starting point of gradient descent) and a variation 
of the coupon collector's problem with coupons that are not equally likely. 

Proof of Theorem H As a function of the input simplex, the distribution of the output of 
the algorithm is equivariant under invertible affine transformations. Namely, if we apply 
an affine transformation to the input simplex, the distribution of the output is equally 
transformed^ The notion of error, total variation distance, is also invariant under invertible 
affine transformations. Therefore, it is enough to analyze the algorithm when the input 
simplex is in isotropic position. In this case ||p(z)|| < n + 1 (see Section [2]) and we can set 
t\ < poly(n, 1/e', log(l/5)) so that < e' with probability at least 1 — 5/10 (by an easy 
application of Chernoff's bound), for some e' to be fixed later. Similarly, using results from 
[H Theorem 4.1], a choice of t± < ne'~ 2 polylog(l/e') polylog(l/<5) implies that the empirical 

2 To see this: the equivariance of the algorithm as a map between distributions is implied by the equivari- 
ance of the algorithm on any given input sample. Now, given the input sample, if we apply an affine trans- 
formation to it, this transformation is undone except possibly for a rotation by the step s(i) = B^ 1 (r(i) — fi). 
A rotation may remain because of the ambiguity in the characterization of B. But the steps of the algorithm 
that follow the definition of s(i) are equivariant under rotation, and the ambiguous rotation will be removed 
at the end when B is applied again in the last step. 
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Algorithm 1 Learning a simplex. 

Input: Error parameter e > 0. Probability of failure parameter 5 > 0. Oracle access to 
random points from some n- dimensional simplex Sin put- 

Output: V — {v(l), . . . , v(n + 1)} C IR n (approximations to the vertices of the simplex). 
Estimate the mean and covariance using t\ = poly(n, 1/e, 1/5) samples p(l), . . . ,p{ti)~- 

V = 

Compute a matrix B so that S = BB T (say, Cholesky decomposition). 
Let U = 0. 

for i = 1 to m (with m = poly(n, log 1/5)) do 

Get t 3 = poly(n, 1/e, log 1/5) samples r(l), . . . r(t 3 ) and use ^i,B to map them to 
samples s{%) from a nearly-isotropic simplex: s(z) = B~ x {r{i) — fi). 

Embed the resulting samples in M n+1 as a sample from an approximately rotated 
standard simplex: Let = T(s(i)). 

Invoke Subroutine [1] with sample 1(1), . . . , Z (t 3 ) to get u G IR n+1 . 

Let u be the nearest point to u in the affine hyperplane {x : x ■ 1 = 1}. If u is not 
within 1 / a/2 of a point in U, add tt to U. (Here 1 / v2 is half of the edge length of the 
standard simplex.) 
end for 
Let 

V = BT-\U)+n 

= ^(n + l)(n + 2)BA T [U 1 j + 
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second moment matrix 

i 

satisfies ||S — I\\ < e' with probability at least 1 — 5/10. We have S = X — fifi T and this 
implies ||S — I\\ < — 7|| + ||/U/i T || < 2e'. Now, s(l), . . . , s(t 3 ) is an iid sample from a 
simplex S' = B~ 1 (Sinput — /•*)■ Simplex S" is close in total variation distance to some 
isotropic simplex^] Si so- More precisely, Lemma below shows that 

d TV (S',S IS o)<12ne', (4) 

with probability at least 1 — 5/5. 

Assume for a moment that s(l), . . . , sfo) are from S/so- The analysis of Subroutine 
[TJ (fixed point-like iteration) given in Lemma S] would guarantee the following: Successive 
invocations to Subroutine [T] find approximations to vertices of T(Sjso) within Euclidean 
distance e" for some e" to be determined later and t% = poly (n, 1 / e", log 1/5) . We ask for each 
invocation to succeed with probability at least 1—5/ (20m) with m = n(log ra+log 20/5). Note 
that each vertex is equally likely to be found. The choice of m is so that, if all m invocations 
succeed (which happens with probability at least 1 — 5/20), then the analysis of the coupon 
collector's problem, Lemma [31 implies that we fail to find a vertex with probability at most 
5/20. Overall, we find all vertices with probability at least 1 — 5/10. 

But in reality samples s(l), . . . , s(t 3 ) are from S 1 , which is only close to Si so- The 
estimate from (jlj) with appropriate e' = poly(l/n, e", 5) gives 

d>Tv(S' , Siso) < tt:;: — > 
10 t 3 m 

which implies that the total variation distance between the joint distribution of all t^m 
samples used in the loop and the joint distribution of actual samples from the isotropic 
simplex Siso is & t most 5/10, and this implies that the loop finds approximations to all 
vertices of T(Siso) when given samples from 5" with probability at least 1 — 5/5. The points 
in U are still within Euclidean distance e" of corresponding vertices of T(Siso)- 

To conclude, we turn our estimate of distances between estimated and true vertices into 
a total variation estimate, and map it back to the input simplex. Let 5"' = conv T~ 1 U. 
As T maps an isotropic simplex to a standard simplex, we have that ^ [n + l){n + 2)T is 
an isometry, and therefore the vertices of S" are within distance e" / Win + l)(n + 2) of the 
corresponding vertices of Siso- Thus, the corresponding support functions are uniformly 
within 

e'" = e"/^(n + l)(n + 2) 
of each other on the unit sphere. This and the fact that Siso 3 B n imply 

(1 - e"')Siso ^ S" C(l + e'")Siso- 

3 The isotropic simplex Siso wm typically be far from the (isotropic) input simplex, because of the 
ambiguity up to orthogonal transformations in the characterization of B. 
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Thus, by Lemma El d TV {S" , S ISO ) < 1 - (tj^T < 1 - (1 - e"') 2n < 2ne"' < 2e" and this 



implies that the total variation distance between the uniform distributions on conv V and 
the input simplex is at most 2e". Over all random choices, this happens with probability at 



Lemma 5. Let Sinput be an n- dimensional isotropic simplex. Let £ be an n-by-n positive 
definite matrix such that ||S — 7|| < e < 1/2. Let fi be an n-dimensional vector such that 
\\fi\\ < e. Let B be an n-by-n matrix such that £ = BB T . Let S be the simplex B~ l (Sin put • — 
fi). Then there exists an isotropic simplex Sjso such that dTv{S, Si so) — 6ne. 

Proof. We use an argument along the lines of the orthogonal Procrustes problem (nearest 
orthogonal matrix to -B -1 , already in [T71 Proof of Theorem 4]): Let UDV T be the singular 
value decomposition of B~ l . Let R = UV T be an orthogonal matrix (that approximates 
B l ). Let Siso = RSinput- 

We have S = UDV T (Sinput — A*)- Let a m i n , a max be the minimum and maximum 
singular values of D, respectively. This implies: 



least 1 - 25/5. We set e" = e 



/2- 



□ 



o~minUV (Sinput — aO Q S C UV (Sinput - n), 

°~min(SlSO ~ Rfl>) Q S C 0-, rnax (Siso ~ RfJ>)- 



(5) 



As Siso B ni \\fi\\ < 1, R is orthogonal and Si so is convex, we have 



RfjLDil-MDSjso- 



Also, 



Siso 



R^ C Siso + ||A*||-B n 
C5 J50 (1 + |HI). 



This in (J^J) gives 



0"mm(l - \\n\\)Siso CSC a max (l + \\fj,\\)Siso- 



This and Lemma [2] imply 




The estimate on S gives a min > y/1 — e, <r max . < a/1 + e. Thus 




<2(l-(l-e) 3n ) 



C 6ne. 



□ 
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6 The local and global maxima of the 3rd moment of 
the standard simplex and the isotropic simplex 

In this section we study the structure of the set of local maxima of the third moment as 
a function of the direction (which happens to be essentially u t— > ^ uf as discussed in 
Section E]). This is not necessary for our algorithmic result, however it gives insight into the 
geometry of the third moment (the location of local maxima/minima and stationary points) 
and suggests that more direct optimization algorithms like gradient descent and Newton's 
method will also work, although we will not prove that. 

Theorem 6. Let K C W 1 be an isotropic simplex. Let X be random in K . Let V = 
{xj}^ 1 C M n be the set of normalized vertices of K . Then V is a complete set of local 
maxima and a complete set of global maxima of F : S^ 1 —> 1R given by F(u) = E((u • X) 3 ). 

Proof idea: Embed the simplex in Show that the third moment is proportional to the 

complete homogeneous symmetric polynomial of degree 3, which for the relevant directions 
is proportional to the sum of cubes. To conclude, use first and second order optimality 
conditions to characterize the set of local maxima. 

Proof. Consider the standard simplex 



and identify it with V via a linear map A : M. n+1 R n so that A(A n ) = V. Let Y 
be random in A n . Consider G : S n — > 1R given by G(v) = m^{v) = E((t> • Y) 3 ). Let 
U = {v G ]R n+1 :t>-l = 0,||v|| = l}be the equivalent feasible set for the embedded 
problem. We have G(v) = cF(Av) for any v £ U and some constant c > independent of 
v. To get the theorem, it is enough to show that the local maxima of G in U are precisely 
the normalized versions of the projections of the canonical vectors onto the hyperplane 
orthogonal to 1 = (1, . . . , 1). According to Section [3], for v £ U we have 



Using a more convenient but equivalent constant, we want to enumerate the local maxima 
of the problem 



A n = conv{ei, . . . , e n+ i} C W 



n+1 



G{v) (xp 3 (v). 



m&x-p 3 (v) 

S.t. V ■ V = 1 



(6) 



v ■ 1 







v £ M ; 



n+1 



The Lagrangian function is 



L(v, Ai, A 2 ) 




1(3 



The first order condition is V V L = 0, that is, 

v 2 = Ai + X 2 Vi for % — 1, . . . , n + 1. (7) 

Consider this system of equations on w for any fixed Ai, A 2 . Let f(x) = x 2 , g(x) = Ai + X 2 x. 
The first order condition says f{i'i) = g(vi), where / is convex and g is affine. That is, the 
ViS can take at most two different values. As our optimization problem ([6]) is symmetric 
under permutation of the coordinates, we conclude that, after putting the coordinates of a 
point v in non- increasing order, if v is a local maximum of ([6]), then v must be of the form 

v = (a,...,a,b,...,b), 

where a > > b and there are exactly a as and bs, for a, (3 E {1, . . . , n}. 

We will now study the second order necessary condition (SONC) to eliminate from the list 
of candidates all vectors with a > 1. It is easy to see that the surviving vectors are exactly 
the promised scaled projections of the canonical vectors. This vectors must all be local and 
global maxima: At least one of them must be a global maximum as we are maximizing a 
continuous function over a compact set and all of them have the same objective value so all 
of them are local and global maxima. 

The SONC at v asks for the Hessian of the Lagrangian to be negative semidefinite when 
restricted to the tangent space to the constraint set at v [El Section 11.5]. We compute the 
Hessian (recall that is the vector of the squared coordinates of v): 

V V L = v {2) - Ail - X 2 v 

V^L = 2diag«)-A 2 / 

where diag(t> ) is the (n + l)-by-(n + 1) matrix having the entries of v in the diagonal and 
elsewhere. 

A vector in the tangent space is any z G M n+1 such that z-t — 0, v -z — 0, and definiteness 
of the Hessian is determined by the sign of z T ^J 2 v Lz for any such z, where 

n+l 

^V^ = ^^ 2 (2^-A 2 ). 

i=l 

Suppose v is a critical point with a > 2. To see that such a v cannot be a local maximum, 
it is enough to show 2a > A 2 , as in that case we can take z — (1, —1, 0, . . . , 0) to make the 
second derivative of L positive in the direction z. 

In terms of a,(3,a,b, the constraints of ([6]) are aa + fib = 0, aa 2 + fib 2 = 1, and this 

implies a = \J a ^ +l ^ , b = — yj p The inner product between the first order condition 

([7]) and v implies A 2 = YL v f = aa3 + /^ 3 - ^ ^ s convenient to consider the change of variable 
7 = a/(n + 1), as now candidate critical points are parameterized by certain discrete values 
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of 7 in (0, 1). This gives — (1 — 7)(n + 1), a = (1 — 7)/(7(n + 1)) and 

A 2 = (n + 1) 

-(!-' 



7 



\ 3/2 

1 — 7 
7(n + 1) 



> 3/2 

7 



(l- 7 )(n + l) 

[(l-7) 2 -7 2 ] 



^(72 + 1)7(1-7) 

1 

= 1 - 2 7 
y/(n + 1)7(1 - 7) 



This implies 



2a - A 2 = ^_=i===[2(l - 7) - 1 + 2 7 ] 
^(71 + 1)7(1-7) 

1 



^(71 + 1)7(1-7)' 

In (0, 1), the function given by 7 1— > 2a — A 2 = , 1 = is convex and symmetric around 

y ("+ 1)7(1-7 ) 

1/2, where it attains its global minimum value, 2/\/n + 1, which is positive. □ 
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